The consumer credit department of a bank wants to automate the decisionmaking process for approval of home equity lines of credit. To do this, they will follow the recommendations of the Equal Credit Opportunity Act to create an empirically derived and statistically sound credit scoring model. The model will be based on data collected from recent applicants granted credit through the current process of loan underwriting. The model will be built from predictive modeling tools, but the created model must be sufficiently interpretable to provide a reason for any adverse actions (rejections).
The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable indicating whether an applicant eventually defaulted or was seriously delinquent. This adverse outcome occurred in 1,189 cases (20%). For each applicant, 12 input variables were recorded.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
data = pd.read_csv('hmeq.csv')
688/891
data.shape
data.isnull().sum()
data.head()
# data overview (especially on nan & nan percentage)
listItem = []
for col in data.columns :
listItem.append([col, data[col].dtype, data[col].isna().sum(), round((data[col].isna().sum()/len(data[col])) * 100,2),
data[col].nunique(), list(data[col].drop_duplicates().sample(2).values)]);
dfDesc = pd.DataFrame(columns=['Data Features', 'Data Type', 'Null Count', 'Null %', 'N-Unique', 'Unique Sample'],
data=listItem)
dfDesc
## rows with nulls
len(data[data.isnull().any(axis=1)])
## rows without nulls
len(data[~data.isnull().any(axis=1)])
## the dataset with nulls
data[data.isnull().any(axis=1)]
## correlate null features
coba = data.copy()
coba['MORTDUE NULL'] = coba['MORTDUE'].isna()
coba['VALUE NULL'] = coba['VALUE'].isna()
coba['REASON NULL'] = coba['REASON'].isna()
coba['JOB NULL'] = coba['JOB'].isna()
coba['YOJ NULL'] = coba['YOJ'].isna()
coba['DEROG NULL'] = coba['DEROG'].isna()
coba['DELINQ NULL'] = coba['DELINQ'].isna()
coba['CLAGE NULL'] = coba['CLAGE'].isna()
coba['CLNO NULL'] = coba['CLNO'].isna()
coba['DEBTINC NULL'] = coba['DEBTINC'].isna()
plt.figure(figsize= (20,20))
sns.heatmap(coba.corr('pearson'), annot= True)
# trying to find the null combinations
pd.options.display.max_rows = None
null = []
for idx in range(len(data)):
temp = []
for i in data.columns:
if (str(data[i].iloc[idx]).lower() == 'nan'):
temp.append(i)
else:
pass
temp.sort()
null.append(temp)
print(len(null))
(pd.Series(null)).value_counts()
From the observations:
# creating data contingency
mydata = data.copy(deep=True)
# drop debtinc & derog columns because > 10% nan
mydata = mydata.drop(['DEBTINC','DEROG'], axis = 1)
# drop nan rows to eliminate noise
mydata = mydata.dropna()
# resetting index after dropna
mydata = mydata.reset_index(drop=True)
# final data shape
mydata.shape
# saving mydata (clean data) to csv for further usage
mydata.to_csv('mydata.csv', index= False)
# final nan values checking
mydata.isnull().sum()
# creating a pairplot with my target variable as the hue
plt.figure(figsize=(40,30))
sns.pairplot(mydata, hue = 'BAD', diag_kind = 'hist')
# groupby BAD description for numerical features
mydata.groupby(['BAD']).describe().T
# whole data description for numerical features
mydata.describe().T
From the observations:
# groupby description for categorical values
mydata.groupby(['BAD']).describe(exclude = 'number').T
# whole data description for categorical values
mydata.describe(exclude='number').T
# grouped by JOB description
pd.options.display.max_rows = 72
mydata.groupby(['JOB']).describe().T
# on JOB-BAD
sns.set_style("whitegrid")
plt.figure(figsize= (18,10))
plt.subplot(2,3,1)
sns.countplot(data = mydata, x='JOB')
plt.subplot(2,3,2)
sns.countplot(data = mydata, x='JOB' , hue = 'BAD')
plt.subplot(2,3,3)
sns.barplot(data = mydata, x='JOB', y='BAD')
# on JOB-LOAN-BAD
plt.figure(figsize= (13,10))
plt.subplot(2,2,1)
sns.barplot(data = mydata, x='JOB', y= 'LOAN')
plt.subplot(2,2,2)
sns.barplot(data = mydata, x='JOB', y= 'LOAN' , hue = 'BAD')
# on JOB-DELINQ-BAD
plt.figure(figsize= (13,10))
plt.subplot(2,2,1)
sns.barplot(data = mydata, x='JOB', y= 'DELINQ' )
plt.subplot(2,2,2)
sns.barplot(data = mydata, x='JOB', y= 'DELINQ' , hue = 'BAD')
# on JOB-NINQ-BAD
plt.figure(figsize= (13,10))
plt.subplot(2,2,1)
sns.barplot(data = mydata, x='JOB', y= 'NINQ')
plt.subplot(2,2,2)
sns.barplot(data = mydata, x='JOB', y= 'NINQ' , hue = 'BAD')
# on JOB-MORTDUE-BAD
plt.figure(figsize= (13,10))
plt.subplot(2,2,1)
sns.barplot(data = mydata, x='JOB', y= 'MORTDUE')
plt.subplot(2,2,2)
sns.barplot(data = mydata, x='JOB', y= 'MORTDUE' , hue = 'BAD')
# on JOB-CLAGE-BAD
plt.figure(figsize= (13,10))
plt.subplot(2,2,1)
sns.barplot(data = mydata, x='JOB', y= 'CLAGE')
plt.subplot(2,2,2)
sns.barplot(data = mydata, x='JOB', y= 'CLAGE' , hue = 'BAD')
# defining cramers v function to see the association between two categorical features
def cramers_v(x,y):
import scipy.stats as ss
confusion_matrix = pd.crosstab(x,y)
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum().sum()
phi2 = chi2/n
r,k = confusion_matrix.shape
phi2corr = max(0, phi2-((k-1)*(r-1))/(n-1))
rcorr = r -((r-1)**2)/(n-1)
kcorr = k -((k-1)**2)/(n-1)
return np.sqrt(phi2corr/min((kcorr-1),(rcorr-1)))
# cramers v value between REASON & BAD
cramers_v(mydata['REASON'], mydata['BAD'])
# cramers v value between JOB & BAD
cramers_v(mydata['JOB'], mydata['BAD'])
In terms of Cramer's V:
# defining correlation ratio function to see the association between numerical-categorical features
def correlation_ratio(categories, measurements):
fcat, _ = pd.factorize(categories)
cat_num = np.max(fcat) + 1
y_avg_array = np.zeros(cat_num)
n_array = np.zeros(cat_num)
for i in range(0, cat_num):
cat_measures = measurements[np.argwhere(fcat==i).flatten()]
n_array[i] = len(cat_measures)
y_avg_array[i] = np.average(cat_measures)
y_total_avg = np.sum(np.multiply(y_avg_array, n_array))/np.sum(n_array)
numerator = np.sum(np.multiply(n_array, np.power(np.subtract(y_avg_array, y_total_avg),2)))
denominator = np.sum(np.power(np.subtract(measurements, y_total_avg),2))
if numerator == 0:
eta = 0.0
else:
eta = np.sqrt(numerator/denominator)
return eta
from scipy.stats import shapiro, anderson, skew, kurtosis
for i in mydata.drop('BAD',axis = 1).select_dtypes(exclude = 'object').columns:
print(i)
print('Correlation Ratio: {}'.format(correlation_ratio(mydata['BAD'], mydata[i])))
if shapiro(mydata[i])[1] < 0.05:
print('Non-Normal Distribution')
else:
print('Normal Distribution')
print('Skewness: {}, Kurtosis {}'.format(skew(mydata[i]), kurtosis(mydata[i])))
print('\n')
In terms of Correlation Ratio:
In terms of Shapiro value:
# plotting pearson correlation heatmap
plt.figure(figsize=(8,10))
sns.heatmap(mydata.corr(),annot =True)
In terms of Pearson Correlation:
# plotting spearman correlation heatmap
plt.figure(figsize=(8,10))
sns.heatmap(mydata.corr(method = 'spearman'),annot =True)
In terms of Spearman Correlation:
## A/B testing for categorical-categorical columns
from scipy.stats import chi2_contingency
chi2_check = []
categorical_columns = mydata.select_dtypes('object').columns
for i in categorical_columns:
if chi2_contingency(pd.crosstab(mydata['BAD'], mydata[i]))[1] < 0.05:
chi2_check.append('Accept H1')
else:
chi2_check.append('Accept H0')
chi = pd.DataFrame(data = [categorical_columns, chi2_check]).T
chi.columns = ['Column', 'Hypothesis']
chi
In terms of Chi-Square test:
## A/B testing for categories within categorical columns
check = {}
for i in chi[chi['Hypothesis'] == 'Accept H1']['Column']:
dummies = pd.get_dummies(mydata[i])
bon_p_value = 0.05/mydata[i].nunique()
for series in dummies:
if chi2_contingency(pd.crosstab(mydata['BAD'], dummies[series]))[1] < bon_p_value:
check['{}_{}'.format(i, series)] = 'Accept H1'
else:
check['{}_{}'.format(i, series)] = 'Accept H0'
res_chi = pd.DataFrame(data = [check.keys(), check.values()]).T
res_chi.columns = ['Pair', 'Hypothesis']
res_chi
In terms of Chi-Square test:
# picking only the ones that Accepts H1
res_chi[res_chi['Hypothesis'] == 'Accept H1']
# plotting only the ones that Accepts H1
for i in res_chi[res_chi['Hypothesis'] == 'Accept H1']['Pair']:
sns.countplot(mydata[mydata[i.split('_')[0]] == (i.split('_')[1])]['BAD'])
plt.title(i)
plt.show()
## A/B testing for categorical-continuous columns
from scipy.stats import mannwhitneyu
mann = []
for i in mydata.drop('BAD', axis=1).select_dtypes('number').columns:
if mannwhitneyu(mydata[mydata['BAD'] == 0][i],
mydata[mydata['BAD'] == 1][i])[1] < 0.05:
mann.append('Accept H1')
else:
mann.append('Accept H0')
res = pd.DataFrame(data = [list(mydata.drop('BAD', axis=1).select_dtypes('number').columns), mann]).T
res.columns = ['Columns', 'Hypothesis']
res
In terms of Mann-Whitney U test:
# picking only the ones that Accepts H1
res[res['Hypothesis'] == 'Accept H1']
# creating dummy variables so all the data are numerical
mydummy = pd.get_dummies(data= mydata, drop_first= True, columns = ['REASON','JOB'])
mydummy.columns
# selecting features
IV = ['LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DELINQ', 'CLAGE', 'NINQ',
'CLNO', 'REASON_HomeImp', 'JOB_Office', 'JOB_Other', 'JOB_ProfExe',
'JOB_Sales', 'JOB_Self']
# independent variables
x = mydummy[IV]
# dependent/target variable
y = mydummy['BAD']
# first, lets use the boxplot to show how the data is distributed
plt.figure(figsize = (8,8))
sns.boxplot(data = mydummy)
plt.xticks(rotation = 90)
# creating dummy copy
from sklearn.preprocessing import StandardScaler
dummy = mydummy.copy()
# rescaling the data
scaler = StandardScaler()
dummy = scaler.fit_transform(dummy)
dummy = pd.DataFrame(dummy, columns = mydummy.columns)
dummy = dummy.drop('BAD', axis=1)
dummy.head()
# after rescaling, lets use the boxplot to show how the data is now distributed
plt.figure(figsize = (8,8))
sns.boxplot(data = dummy)
plt.xticks(rotation = 90)
# let'use the PCA to reduce our 14 features to 10 columns
from sklearn.decomposition import PCA
pca = PCA(n_components=10, random_state=101)
pca.fit(dummy)
x_pca = pca.transform(dummy)
# displaying PCA columns
x_pca = pd.DataFrame(x_pca, columns = ['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10'])
x_pca
# describing the explanation ratio of each PC line
pca.explained_variance_ratio_
From the pca explained variance ratio function, we could see that:
# sum the total pca explained variance ratio (in %)
sum(pca.explained_variance_ratio_)
# inserting the BAD column into the PCA columns
x_pca['BAD'] = mydummy['BAD']
# inserting a new column based on HDBSCAN labels
from hdbscan import HDBSCAN
hdbscan = HDBSCAN(min_cluster_size = 2, min_samples = 25)
hdbscan.fit(x_pca[['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10']])
x_pca['HDBScan'] = hdbscan.labels_
# identifying the numbers of clusters
n_clusters = len(set(hdbscan.labels_)) - (1 if -1 in hdbscan.labels_ else 0)
n_clusters
# identifying the numbers of noise/outliers
n_noise = list(hdbscan.labels_).count(-1)
n_noise
# plotting the minimun spanning tree
hdbscan = HDBSCAN(min_cluster_size=3, gen_min_span_tree=True)
hdbscan.fit(x_pca[['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10']])
plt.figure(figsize=(12,8))
hdbscan.minimum_spanning_tree_.plot(edge_cmap='viridis',
edge_alpha=0.6,
node_size=80,
edge_linewidth=2)
# plotting the clusters
f, (ax1, ax2) = plt.subplots(1,2, sharey = True, figsize = (15,6))
ax1.set_title('Original')
ax1.scatter(x_pca['PC1'], x_pca['PC2'], x_pca['PC3'])
ax2.set_title('HDBSCAN')
ax2.scatter(x_pca['PC1'], x_pca['PC2'], x_pca['PC3'],c=x_pca['HDBScan'], cmap = 'rainbow')
# pairplotting the whole data
df = mydata.copy()
df['HDBSCAN LABEL'] = x_pca['HDBScan']
sns.pairplot(df, hue = 'BAD', diag_kind = 'hist')
# showing the grouped by clusters description (outlier = -1)
pd.options.display.max_rows = 121
df.groupby('HDBSCAN LABEL').describe().T
df.groupby('HDBSCAN LABEL').describe(exclude='number').T
From the observations:
- highest loan requested amount (LOAN) mean of USD 45.143
- highest amount due on existing mortgage (MORTDUE) mean of USD 224.648
- highest current property value (VALUE) mean of USD 307.784
- highest amount of delinquent credit lines (DELINQ) mean of 2,5 credit lines
- highest age of credit/trade line in months (CLAGE) mean of 283,6 months
- highest amount of recent credit lines (NINQ) mean of 2,7 credit lines
- highest JOB category of "Self"
# splitting the training data - test data = 80% : 20%
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state= 101)
# create a new x_train & y_train variable (x_trainres & y_trainres) that is resampled using SMOTE method
from imblearn.over_sampling import SMOTE
from collections import Counter
y_train = y_train.astype('int')
smo = SMOTE(random_state=0, sampling_strategy='minority')
x_trainres, y_trainres = smo.fit_resample(x_train, y_train)
print(sorted(Counter(y_trainres).items()))
# independent feature train shape
x_train.shape
# independent feature resampled train shape
x_trainres.shape
# model fitting for normal data
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(x_train,y_train)
# model fitting for oversampled data
dtree2 = DecisionTreeClassifier()
dtree2.fit(x_trainres,y_trainres)
# classification report for normal data
from sklearn.metrics import classification_report,confusion_matrix
dtree_pred = dtree.predict(x_test)
dtree_predprob = dtree.predict_proba(x_test)
print(classification_report(y_test, dtree_pred))
# classification report for oversampled data
dtree_pred2 = dtree2.predict(x_test)
dtree_predprob2 = dtree2.predict_proba(x_test)
print(classification_report(y_test, dtree_pred2))
# confusion matrix for normal data
cnf_matrix = confusion_matrix(y_test, dtree_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Normal')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
# confusion matrix for oversampled data
cnf_matrix = confusion_matrix(y_test, dtree_pred2)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Oversampled')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
from sklearn import metrics
# ROC - AUC Score for normal data
dtree_pred_proba = dtree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, dtree_pred_proba)
auc = metrics.roc_auc_score(y_test, dtree_pred_proba)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Normal')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# ROC - AUC Score for oversampled data
dtree_pred_proba2 = dtree2.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, dtree_pred_proba2)
auc = metrics.roc_auc_score(y_test, dtree_pred_proba2)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Oversampled')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
from sklearn.model_selection import cross_val_score
# Cross Validation score for normal data
dtreescores = cross_val_score(estimator=dtree,
X=x_train,
y=y_train,
cv=10,
n_jobs=1,
scoring = 'roc_auc')
print('Cross validation - Normal Data scores: {}'.format(dtreescores))
plt.title('Cross Validation - Normal Data')
plt.scatter(np.arange(len(dtreescores)), dtreescores)
plt.axhline(y=np.mean(dtreescores), color='g') # Mean value of cross validation scores
plt.show()
# Cross Validation score for oversampled data
dtreescores2 = cross_val_score(estimator=dtree2,
X=x_trainres,
y=y_trainres,
cv=10,
n_jobs=1,
scoring = 'roc_auc')
print('Cross Validation - Oversampled Data scores: {}'.format(dtreescores2))
plt.title('Cross Validation - Oversampled Data')
plt.scatter(np.arange(len(dtreescores2)), dtreescores2)
plt.axhline(y=np.mean(dtreescores2), color='g') # Mean value of cross validation scores
plt.show()
from IPython.display import Image
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
import pydot
dot_data = StringIO()
export_graphviz(dtree, out_file = dot_data, feature_names = x_train.columns, filled=True, rounded=True, special_characters=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph[0].create_png())
dot_data = StringIO()
export_graphviz(dtree2, out_file = dot_data, feature_names = x_trainres.columns, filled=True, rounded=True, special_characters=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph[0].create_png())
from eli5 import show_weights
from eli5.sklearn import PermutationImportance
# permutation importance for normal data
dtreeperm = PermutationImportance(dtree, scoring = 'roc_auc', random_state= 101).fit(x_test, y_test)
show_weights(dtreeperm, feature_names = list(x_test.columns))
# permutation importance for oversampled data
dtreeperm2 = PermutationImportance(dtree2, scoring = 'roc_auc', random_state= 101).fit(x_test, y_test)
show_weights(dtreeperm2, feature_names = list(x_test.columns))
# model fitting for normal data
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(random_state= 190, n_estimators=1000)
rfc.fit(x_train, y_train)
# model fitting for oversampled data
from sklearn.ensemble import RandomForestClassifier
rfc2 = RandomForestClassifier(random_state= 190, n_estimators=1000)
rfc2.fit(x_trainres, y_trainres)
# classification report for normal data
rfc_pred = rfc.predict(x_test)
rfc_predprob = rfc.predict_proba(x_test)
print(classification_report(y_test, rfc_pred))
# classification report for oversampled data
rfc_pred2 = rfc2.predict(x_test)
rfc_predprob2 = rfc2.predict_proba(x_test)
print(classification_report(y_test, rfc_pred2))
# confusion matrix for normal data
cnf_matrix = confusion_matrix(y_test, rfc_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Normal')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
# confusion matrix for oversampled data
cnf_matrix = confusion_matrix(y_test, rfc_pred2)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Oversampled')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
# ROC - AUC Score for normal data
rfc_pred_proba = rfc.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, rfc_pred_proba)
auc = metrics.roc_auc_score(y_test, rfc_pred_proba)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Normal')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# ROC - AUC Score for oversampled data
rfc_pred_proba2 = rfc2.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, rfc_pred_proba2)
auc = metrics.roc_auc_score(y_test, rfc_pred_proba2)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Oversampled')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# Cross Validation score for normal data
rfcscores = cross_val_score(estimator=rfc,
X=x_train,
y=y_train,
cv=10,
n_jobs=1,
scoring = 'roc_auc')
print('Cross validation - Normal Data scores: {}'.format(rfcscores))
plt.title('Cross Validation - Normal Data')
plt.scatter(np.arange(len(rfcscores)), rfcscores)
plt.axhline(y=np.mean(rfcscores), color='g') # Mean value of cross validation scores
plt.show()
# Cross Validation score for oversampled data
rfcscores2 = cross_val_score(estimator=rfc2,
X=x_trainres,
y=y_trainres,
cv=10,
n_jobs=1,
scoring = 'roc_auc')
print('Cross Validation - Oversampled Data scores: {}'.format(rfcscores2))
plt.title('Cross Validation - Oversampled Data')
plt.scatter(np.arange(len(rfcscores2)), rfcscores2)
plt.axhline(y=np.mean(rfcscores2), color='g') # Mean value of cross validation scores
plt.show()
# Feature Importance for normal data
rfc_coef1 = pd.Series(rfc.feature_importances_, x_train.columns).sort_values(ascending= False)
rfc_coef1.plot(kind = 'bar', title='Feature Importances - Normal Data')
plt.show()
# Feature Importance for oversampled data
rfc_coef2 = pd.Series(rfc2.feature_importances_, x_trainres.columns).sort_values(ascending= False)
rfc_coef2.plot(kind = 'bar', title='Feature Importances - Oversampled Data')
plt.show()
# permutation importance for normal data
rfcperm = PermutationImportance(rfc, scoring = 'roc_auc', random_state= 101).fit(x_test, y_test)
show_weights(rfcperm, feature_names = list(x_test.columns))
# permutation importance for oversampled data
rfcperm2 = PermutationImportance(rfc2, scoring = 'roc_auc', random_state= 101).fit(x_test, y_test)
show_weights(rfcperm2, feature_names = list(x_test.columns))
# model fitting for normal data
from xgboost import XGBClassifier
xgb = XGBClassifier(learning_rate = 0.01, n_estimators = 10000, max_depth = None, n_jobs = -1)
xgb.fit(x_train, y_train)
# model fitting for oversampled data
xgb2 = XGBClassifier(learning_rate = 0.01, n_estimators = 10000, max_depth = None, n_jobs = -1)
xgb2.fit(x_trainres, y_trainres)
# classification report for normal data
xgb_pred = xgb.predict(x_test)
xgb_predprob = xgb.predict_proba(x_test)
print(classification_report(y_test, xgb_pred))
# classification report for oversampled data
xgb_pred2 = xgb2.predict(x_test)
xgb_predprob2 = xgb2.predict_proba(x_test)
print(classification_report(y_test, xgb_pred2))
# confusion matrix for normal data
cnf_matrix = confusion_matrix(y_test, xgb_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Normal')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
# confusion matrix for oversampled data
cnf_matrix = confusion_matrix(y_test, xgb_pred2)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Oversampled')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
# ROC - AUC Score for normal data
xgb_pred_proba = xgb.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, xgb_pred_proba)
auc = metrics.roc_auc_score(y_test, xgb_pred_proba)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Normal')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# ROC - AUC Score for oversampled data
xgb_pred_proba2 = xgb2.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, xgb_pred_proba2)
auc = metrics.roc_auc_score(y_test, xgb_pred_proba2)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Oversampled')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# Cross Validation score for normal data
xgbscores = cross_val_score(estimator=xgb,
X=x_train,
y=y_train,
cv=10,
n_jobs=1,
scoring = 'roc_auc')
print('Cross validation - Normal Data scores: {}'.format(xgbscores))
plt.title('Cross Validation - Normal Data')
plt.scatter(np.arange(len(xgbscores)), xgbscores)
plt.axhline(y=np.mean(xgbscores), color='g') # Mean value of cross validation scores
plt.show()
# Cross Validation score for oversampled data
xgbscores2 = cross_val_score(estimator=xgb2,
X=x_trainres,
y=y_trainres,
cv=10,
n_jobs=1,
scoring = 'roc_auc')
print('Cross Validation - Oversampled Data scores: {}'.format(xgbscores2))
plt.title('Cross Validation - Oversampled Data')
plt.scatter(np.arange(len(xgbscores2)), xgbscores2)
plt.axhline(y=np.mean(xgbscores2), color='g') # Mean value of cross validation scores
plt.show()
# Feature Importance for normal data
xgbcoef1 = pd.Series(xgb.feature_importances_, x_train.columns).sort_values(ascending= False)
xgbcoef1.plot(kind = 'bar', title='Feature Importances - Normal Data')
plt.show()
# Feature Importance for oversampled data
xgbcoef2 = pd.Series(xgb2.feature_importances_, x_trainres.columns).sort_values(ascending= False)
xgbcoef2.plot(kind = 'bar', title='Feature Importances - Oversampled Data')
plt.show()
# permutation importance for normal data
xgbperm = PermutationImportance(xgb, scoring = 'roc_auc', random_state= 101).fit(x_test, y_test)
show_weights(xgbperm, feature_names = list(x_test.columns))
# permutation importance for oversampled data
xgbperm2 = PermutationImportance(xgb2, scoring = 'roc_auc', random_state= 101).fit(x_test, y_test)
show_weights(xgbperm2, feature_names = list(x_test.columns))
# model fitting for normal data
from sklearn.naive_bayes import BernoulliNB
nb = BernoulliNB()
nb.fit(x_train, y_train)
# model fitting for oversampled data
nb2 = BernoulliNB()
nb2.fit(x_trainres, y_trainres)
# classification report for normal data
nb_pred = nb.predict(x_test)
nb_predprob = nb.predict_proba(x_test)
print(classification_report(y_test, nb_pred))
# classification report for oversampled data
nb_pred2 = nb2.predict(x_test)
nb_predprob2 = nb2.predict_proba(x_test)
print(classification_report(y_test, nb_pred2))
# confusion matrix for normal data
cnf_matrix = confusion_matrix(y_test, nb_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Normal')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
# confusion matrix for oversampled data
cnf_matrix = confusion_matrix(y_test, nb_pred2)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Oversampled')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
# ROC - AUC Score for normal data
nb_pred_proba = nb.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, nb_pred_proba)
auc = metrics.roc_auc_score(y_test, nb_pred_proba)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Normal')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# ROC - AUC Score for oversampled data
nb_pred_proba2 = nb2.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, nb_pred_proba2)
auc = metrics.roc_auc_score(y_test, nb_pred_proba2)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Oversampled')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# Cross Validation score for normal data
nbscores = cross_val_score(estimator=nb,
X=x_train,
y=y_train,
cv=10,
n_jobs=1,
scoring = 'roc_auc')
print('Cross validation - Normal Data scores: {}'.format(nbscores))
plt.title('Cross Validation - Normal Data')
plt.scatter(np.arange(len(nbscores)), nbscores)
plt.axhline(y=np.mean(nbscores), color='g') # Mean value of cross validation scores
plt.show()
# Cross Validation score for oversampled data
nbscores2 = cross_val_score(estimator=nb2,
X=x_trainres,
y=y_trainres,
cv=10,
n_jobs=1,
scoring = 'roc_auc')
print('Cross Validation - Oversampled Data scores: {}'.format(nbscores2))
plt.title('Cross Validation - Oversampled Data')
plt.scatter(np.arange(len(nbscores2)), nbscores2)
plt.axhline(y=np.mean(nbscores2), color='g') # Mean value of cross validation scores
plt.show()
# permutation importance for normal data
nbperm = PermutationImportance(nb, scoring = 'roc_auc', random_state= 101).fit(x_test, y_test)
show_weights(nbperm, feature_names = list(x_test.columns))
# permutation importance for oversampled data
nbperm2 = PermutationImportance(nb2, scoring = 'roc_auc', random_state= 101).fit(x_test, y_test)
show_weights(nbperm2, feature_names = list(x_test.columns))
# creating new dummy variable for scaling
mydummy2 = mydummy.copy()
mydummy2.columns
# importing standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(mydummy2.drop('BAD',axis=1))
# scaling features
scaled_features = scaler.transform(mydummy2.drop('BAD',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=mydummy2.columns[1:])
df_feat.head()
# splitting data
X_Train, X_Test, Y_Train, Y_Test = train_test_split(scaled_features,mydummy2['BAD'],
test_size=0.20, random_state= 101)
# SMOTE resampling
Y_Train = Y_Train.astype('int')
smo = SMOTE(random_state=0, sampling_strategy='minority')
X_Trainres, Y_Trainres = smo.fit_resample(X_Train, Y_Train)
print(sorted(Counter(Y_Trainres).items()))
# model fitting for normal data
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_Train, Y_Train)
# model fitting for undersampled data
from sklearn.neighbors import KNeighborsClassifier
knn2 = KNeighborsClassifier(n_neighbors=3)
knn2.fit(X_Trainres, Y_Trainres)
# classification report for normal data
knn_pred = knn.predict(X_Test)
knn_predprob = knn.predict_proba(X_Test)
print(classification_report(Y_Test, knn_pred))
# classification report for oversampled data
knn_pred2 = knn2.predict(X_Test)
knn_predprob2 = knn2.predict_proba(X_Test)
print(classification_report(Y_Test, knn_pred2))
# confusion matrix for normal data
cnf_matrix = confusion_matrix(y_test, knn_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Normal')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
# confusion matrix for oversampled data
cnf_matrix = confusion_matrix(y_test, knn_pred2)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Oversampled')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
# identifying error rates
error_rate = []
for i in range(1,40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_Train,Y_Train)
pred_i = knn.predict(X_Test)
error_rate.append(np.mean(pred_i != Y_Test))
error_rate2 = []
for i in range(1,40):
knn2 = KNeighborsClassifier(n_neighbors=i)
knn2.fit(X_Trainres,Y_Trainres)
pred_i2 = knn2.predict(X_Test)
error_rate2.append(np.mean(pred_i2 != Y_Test))
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value in Normal Data')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.show()
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate2,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value in Oversampled Data')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.show()
from sklearn import metrics
# ROC - AUC Score for normal data
knn_pred_proba = knn.predict_proba(X_Test)[::,1]
fpr, tpr, _ = metrics.roc_curve(Y_Test, knn_pred_proba)
auc = metrics.roc_auc_score(Y_Test, knn_pred_proba)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Normal')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# ROC - AUC Score for oversampled data
knn_pred_proba2 = knn2.predict_proba(X_Test)[::,1]
fpr, tpr, _ = metrics.roc_curve(Y_Test, knn_pred_proba2)
auc = metrics.roc_auc_score(Y_Test, knn_pred_proba2)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Oversampled')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# Cross Validation score for normal data
knnscores = cross_val_score(estimator=knn,
X=X_Train,
y=Y_Train,
cv=10,
n_jobs=1,
scoring = 'roc_auc')
print('Cross validation - Normal Data scores: {}'.format(knnscores))
plt.title('Cross Validation - Normal Data')
plt.scatter(np.arange(len(knnscores)), knnscores)
plt.axhline(y=np.mean(knnscores), color='g') # Mean value of cross validation scores
plt.show()
# Cross Validation score for oversampled data
knnscores2 = cross_val_score(estimator=knn2,
X=X_Trainres,
y=Y_Trainres,
cv=10,
n_jobs=1,
scoring = 'roc_auc')
print('Cross Validation - Oversampled Data scores: {}'.format(knnscores2))
plt.title('Cross Validation - Oversampled Data')
plt.scatter(np.arange(len(knnscores2)), knnscores2)
plt.axhline(y=np.mean(knnscores2), color='g') # Mean value of cross validation scores
plt.show()
# permutation importance for normal data
knnperm = PermutationImportance(knn, scoring = 'roc_auc', random_state= 101).fit(x_test, y_test)
show_weights(knnperm, feature_names = list(x_test.columns))
# permutation importance for oversampled data
knnperm2 = PermutationImportance(knn2, scoring = 'roc_auc', random_state= 101).fit(x_test, y_test)
show_weights(knnperm2, feature_names = list(x_test.columns))
from sklearn.linear_model import LogisticRegression
# fit the model with normal data
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
# fit the model with oversampled data
logreg2 = LogisticRegression()
logreg2.fit(x_trainres, y_trainres)
# classification report for normal data
from sklearn.metrics import classification_report
logreg_pred = logreg.predict(x_test)
logreg_predprob = logreg.predict_proba(x_test)
print(classification_report(y_test, logreg_pred))
# classification report for oversampled data
logreg_pred2 = logreg2.predict(x_test)
logreg_predprob2 = logreg2.predict_proba(x_test)
print(classification_report(y_test, logreg_pred2))
# confusion matrix for normal data
cnf_matrix = confusion_matrix(y_test, logreg_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Normal')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
# confusion matrix for oversampled data
cnf_matrix = confusion_matrix(y_test, logreg_pred2)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Oversampled')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
from sklearn import metrics
print('Matthews Correlation Coefficient - Normal Data: {}'.format(metrics.matthews_corrcoef(y_test, logreg_pred)))
print('Matthews Correlation Coefficient - Oversampled Data: {}'.format(metrics.matthews_corrcoef(y_test, logreg_pred2)))
print('Log Loss - Normal Data: {}'.format(metrics.log_loss(y_test, logreg_predprob)))
print('Log Loss - Oversampled Data: {}'.format(metrics.log_loss(y_test, logreg_predprob2)))
# ROC - AUC Score for normal data
logreg_pred_proba = logreg.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, logreg_pred_proba)
auc = metrics.roc_auc_score(y_test, logreg_pred_proba)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Normal')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# ROC - AUC Score for oversampled data
logreg_pred_proba2 = logreg2.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, logreg_pred_proba2)
auc = metrics.roc_auc_score(y_test, logreg_pred_proba2)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Oversampled')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
import statsmodels.api as sm
# model summary for normal data
logit_model = sm.Logit(y_train, sm.add_constant(x_train))
result = logit_model.fit(method = 'lbfgs')
print('##### Model Summary - Normal Data: #####')
print('\n')
print(result.summary2())
print('\n')
# model summary for oversampled data
logit_model2 = sm.Logit(y_trainres, sm.add_constant(x_trainres))
result2 = logit_model.fit(method = 'lbfgs')
print('##### Model Summary - Oversampled Data: #####')
print('\n')
print(result2.summary2())
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p, validate = True)
X_train = x_train.copy()
X_trainres = x_trainres.copy()
# transforming features in normal data with log1p
for i in x_train.columns:
X_train[i] = transformer.fit_transform(np.array(X_train[i]).reshape(1,-1))[0]
# transforming features in oversampled data with log1p
for j in x_trainres.columns:
X_trainres[j] = transformer.fit_transform(np.array(X_trainres[j]).reshape(1,-1))[0]
X_train.head()
# fit the model with normal transformed data
logregt = LogisticRegression()
logregt.fit(X_train, y_train)
# fit the model with oversampled data
logregt2 = LogisticRegression()
logregt2.fit(X_trainres, y_trainres)
# classification report for normal transformed data
logregt_pred = logregt.predict(x_test)
logregt_predprob = logregt.predict_proba(x_test)
print(classification_report(y_test, logregt_pred))
# classification report for oversampled transformed data
logregt_pred2 = logregt2.predict(x_test)
logregt_predprob2 = logregt2.predict_proba(x_test)
print(classification_report(y_test, logregt_pred2))
# confusion matrix for normal transformed data
cnf_matrix = confusion_matrix(y_test, logregt_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Normal')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
# confusion matrix for oversampled transformed data
cnf_matrix = confusion_matrix(y_test, logregt_pred2)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Oversampled')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
print('Matthews Correlation Coefficient - Normal Transformed Data: {}'.format(metrics.matthews_corrcoef(y_test, logregt_pred)))
print('Matthews Correlation Coefficient - Oversampled Transformed Data: {}'.format(metrics.matthews_corrcoef(y_test, logregt_pred2)))
print('Log Loss - Normal Transformed Data: {}'.format(metrics.log_loss(y_test, logregt_predprob)))
print('Log Loss - Oversampled Transformed Data: {}'.format(metrics.log_loss(y_test, logregt_predprob2)))
# ROC - AUC Score for normal transformed data
logregt_pred_proba = logregt.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, logregt_pred_proba)
auc = metrics.roc_auc_score(y_test, logregt_pred_proba)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Normal & Transformed')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# ROC - AUC Score for oversampled transformed data
logregt_pred_proba2 = logregt2.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, logregt_pred_proba2)
auc = metrics.roc_auc_score(y_test, logregt_pred_proba2)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - Oversampled & Transformed')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
from sklearn.model_selection import cross_val_score
# Cross Validation score for normal data
logregtscores = cross_val_score(estimator=logregt,
X=X_train,
y=y_train,
cv=10,
n_jobs=1,
scoring = 'roc_auc')
print('Cross validation - Normal Data scores: {}'.format(logregtscores))
plt.title('Cross Validation - Normal Data')
plt.scatter(np.arange(len(logregtscores)), logregtscores)
plt.axhline(y=np.mean(logregtscores), color='g') # Mean value of cross validation scores
plt.show()
# Cross Validation score for oversampled data
logregtscores2 = cross_val_score(estimator=logregt2,
X=X_trainres,
y=y_trainres,
cv=10,
n_jobs=1,
scoring = 'roc_auc')
print('Cross Validation - Oversampled Data scores: {}'.format(logregtscores2))
plt.title('Cross Validation - Oversampled Data')
plt.scatter(np.arange(len(logregtscores2)), logregtscores2)
plt.axhline(y=np.mean(logregtscores2), color='g') # Mean value of cross validation scores
plt.show()
# model summary for normal transformed data
logit_model_t = sm.Logit(y_train, sm.add_constant(X_train))
result_t = logit_model_t.fit(method = 'lbfgs')
print('##### Model Summary - Normal Data: #####')
print('\n')
print(result_t.summary2())
print('\n')
# model summary for oversampled transformed data
logit_mode_t2 = sm.Logit(y_trainres, sm.add_constant(X_trainres))
result_t2 = logit_mode_t2.fit(method = 'lbfgs')
print('##### Model Summary - Oversampled Data: #####')
print('\n')
print(result_t2.summary2())
- DELINQ
- CLAGE
- NINQ
- LOAN
- VALUE
- JOB_Office
- REASON_HomeImp
# Finding the highest accuracy & recall score (trying to minimize False Negative)
list1 = [dtree_pred, dtree_pred2, rfc_pred, rfc_pred2, xgb_pred, xgb_pred2, knn_pred, knn_pred2, nb_pred, nb_pred2, logreg_pred, logreg_pred2,
logregt_pred, logregt_pred2,]
list3 = ['DT :', 'DT2 :', 'RF :', 'RF2 :', 'XGB :', 'XGB2:', 'KNN :', 'KNN2:', 'NB :', 'NB2 :',
'LR :', 'LR2 :', 'LRT :', 'LRT2:' ]
for i,j in list(zip(list1,list3)):
print('Accuracy Score in',j,metrics.accuracy_score(y_test, i))
print(' Recall Score in',j,metrics.recall_score(y_test, i))
The highest overall Accuracy & Recall score is achieved by KNN2 (K-NearestNeighboursClassifier - Oversampled) with an Accuracy Score of 0.9647727272727272 & Recall Score of 0.8440860215053764.
- It means that the model predicted 97% of data rightly & recalled 84% of all the Actual Positives
The second highest overall Accuracy & Recall score is achieved by RF2 (RandomForestClassifier - Oversampled) with an Accuracy Score of 0.9409090909090909 & Recall Score of 0.7473118279569892.
- It means that the model predicted 94% of data rightly & recalled 75% of all the Actual Positives
# Finding the highest ROC-AUC score
list2 = [dtree_pred_proba, dtree_pred_proba2, rfc_pred_proba, rfc_pred_proba2, xgb_pred_proba, xgb_pred_proba2, knn_pred_proba, knn_pred_proba2, nb_pred_proba, nb_pred_proba2, logreg_pred_proba, logreg_pred_proba2,
logregt_pred_proba, logregt_pred_proba2,]
list3 = ['DT :', 'DT2 :', 'RF :', 'RF2 :', 'XGB :', 'XGB2:', 'KNN :', 'KNN2:', 'NB :', 'NB2 :',
'LR :', 'LR2 :', 'LRT :', 'LRT2:' ]
for i,j in list(zip(list2,list3)):
print('ROC-AUC Score in',j,metrics.roc_auc_score(y_test, i))
# Finding the highest average of Cross-Validation scores (cv=10)
list1 = [dtreescores, dtreescores2, rfcscores, rfcscores2, xgbscores, xgbscores2, knnscores, knnscores2, nbscores, nbscores2, logregtscores, logregtscores2]
list3 = ['DT :', 'DT2 :', 'RF :', 'RF2 :', 'XGB :', 'XGB2:', 'KNN :', 'KNN2:', 'NB :', 'NB2 :',
'LRT :', 'LRT2:']
for i,j in list(zip(list1,list3)):
print('Average Cross-Validation score in',j,sum(i)/len(i))
## RANDOMIZEDSEARCHCV
from sklearn.model_selection import RandomizedSearchCV
# number of tress in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# number of features to consider at every split
max_features = ['auto','sqrt']
# maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# minimum number of samples required to split a node
min_samples_split = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 100)]
# minimum number of samples required at each leaf node
min_samples_leaf = [int(x) for x in np.linspace(1, 11, num = 10)]
# method of selecting samples for training each tree
bootstrap = [True, False]
# create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
# making three sets of RandomizedSearch
rf_random = RandomizedSearchCV(estimator = rfc,
param_distributions = random_grid,
cv = 3, n_jobs = -1, n_iter = 10)
rf_random2 = RandomizedSearchCV(estimator = rfc,
param_distributions = random_grid,
cv = 3, n_jobs = -1, n_iter = 10)
rf_random3 = RandomizedSearchCV(estimator = rfc,
param_distributions = random_grid,
cv = 3, n_jobs = -1, n_iter = 10)
# first set
rf_random.fit(x_trainres, y_trainres)
# first set's best params
rf_random.best_params_
# second set
rf_random2.fit(x_trainres, y_trainres)
# second set's best param
rf_random2.best_params_
# third set
rf_random3.fit(x_trainres, y_trainres)
# third set's best param
rf_random3.best_params_
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=1000,
n_jobs=None, oob_score=False, random_state=190,
verbose=0, warm_start=False)
## GRIDSEARCHCV
from sklearn.model_selection import GridSearchCV
# using estimator of rfc2, our best & most consistent model
# the hyperparameter input here is fitted in accordance with best hyperparameters that we have found in the previous model as well as its randomized searches
# since we are focusing on its ability to avoid False Negatives, we use recall as its gridsearchcv scoring
grid = GridSearchCV(estimator = rfc2,
refit = 'recall',
param_grid = {
'n_estimators':[1000,1800],
'bootstrap': [True,False],
'max_features': ['sqrt','auto'],
'max_depth': [80,None],
},
scoring = 'recall',
cv = 5, n_jobs = -1)
grid.fit(x_trainres, y_trainres)
# best score
grid.best_score_
# best parameters
grid.best_params_
# creating & fitting a new RFC model with the best parameters from GridSearch
from sklearn.ensemble import RandomForestClassifier
tuned = RandomForestClassifier(bootstrap= False, max_depth = 80, max_features = 'sqrt', n_estimators = 1800, random_state = 1)
tuned.fit(x_trainres, y_trainres)
print("-------------BEFORE HYPERPARAMETER TUNING-------------")
print(classification_report(y_test,rfc_pred2))
tuned_pred = tuned.predict(x_test)
tuned_predprob = tuned.predict_proba(x_test)
print("\n")
print("--------------AFTER HYPERPARAMETER TUNING-------------")
print(classification_report(y_test,tuned_pred))
print("-------------BEFORE HYPERPARAMETER TUNING-------------")
print('Accuracy Score',metrics.accuracy_score(y_test, rfc_pred2))
print(' Recall Score',metrics.recall_score(y_test, rfc_pred2))
print(' F1 Score',metrics.f1_score(y_test, rfc_pred2))
print("\n")
print("-------------AFTER HYPERPARAMETER TUNING---------------")
print('Accuracy Score',metrics.accuracy_score(y_test, tuned_pred))
print(' Recall Score',metrics.recall_score(y_test, tuned_pred))
print(' F1 Score',metrics.f1_score(y_test, tuned_pred))
# confusion matrix for after hyperparameter tuning
cnf_matrix = confusion_matrix(y_test, tuned_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, fmt='g')
plt.tight_layout()
plt.title('Confusion Matrix - Oversampled')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
# ROC - AUC Score for after hyperparameter tuning
tuned_pred_proba = tuned.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, tuned_pred_proba)
auc = metrics.roc_auc_score(y_test, tuned_pred_proba)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - After Hyperparameter Tuning')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# Cross Validation score for tuned & oversampled data
tunedscores = cross_val_score(estimator=tuned,
X=x_trainres,
y=y_trainres,
cv=10,
n_jobs=1,
scoring = 'roc_auc')
print('Cross Validation - Oversampled Data scores: {}'.format(tunedscores))
plt.title('Cross Validation - Oversampled Data')
plt.scatter(np.arange(len(tunedscores)), tunedscores)
plt.axhline(y=np.mean(tunedscores), color='g') # Mean value of cross validation scores
plt.show()
print('Average value of cross validation scores: ',np.mean(tunedscores))
from sklearn.model_selection import learning_curve
train_sizes, train_scores, test_scores = learning_curve(estimator=tuned,
X=x,
y=y,
train_sizes=np.linspace(0.5, 1.0, 5),
cv=10)
# Mean value of accuracy against training data
train_mean = np.mean(train_scores, axis=1)
print('train mean: ')
print(train_mean)
# Standard deviation of training accuracy per number of training samples
train_std = np.std(train_scores, axis=1)
# Same as above for test data
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
print('test mean: ')
print(test_mean)
# Plot training accuracies
plt.plot(train_sizes, train_mean, color='red', marker='o', label='Training Accuracy')
# Plot the variance of training accuracies
plt.fill_between(train_sizes,
train_mean + train_std,
train_mean - train_std,
alpha=0.15, color='red')
# Plot for test data as training data
plt.plot(train_sizes, test_mean, color='blue', linestyle='--', marker='s',
label='Test Accuracy')
plt.fill_between(train_sizes,
test_mean + test_std,
test_mean - test_std,
alpha=0.15, color='blue')
plt.xlabel('Number of training samples')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
# let's benchmark our tuned model performance with AutoML
from tpot import TPOTClassifier
tpot = TPOTClassifier(subsample = 0.8, verbosity = 2, warm_start=True, early_stop=20, max_time_mins= 60, n_jobs= -2)
# fitting TPOT to our data
tpot.fit(x_train, y_train)
# exporting our model results
tpot.export('tpot_LOAN_DEFAULT.py')
# using our AutoML model ('tpot_LOAN_DEFAULT.py') to normal & oversampled data
import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.tree import DecisionTreeClassifier
from tpot.builtins import StackingEstimator
# Average CV score on the training set was: 0.9353463587921848
# fitting the AutoML for the normal data
exported_pipeline = make_pipeline(
PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
StackingEstimator(estimator=ExtraTreesClassifier(bootstrap=False, criterion="gini", max_features=0.55, min_samples_leaf=2, min_samples_split=5, n_estimators=100)),
MinMaxScaler(),
StackingEstimator(estimator=BernoulliNB(alpha=0.001, fit_prior=True)),
DecisionTreeClassifier(criterion="entropy", max_depth=5, min_samples_leaf=12, min_samples_split=9))
exported_pipeline.fit(x_train, y_train)
automl = exported_pipeline.predict(x_test)
# fitting the AutoML for the oversampled data
exported_pipeline2 = make_pipeline(
PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
StackingEstimator(estimator=ExtraTreesClassifier(bootstrap=False, criterion="gini", max_features=0.55, min_samples_leaf=2, min_samples_split=5, n_estimators=100)),
MinMaxScaler(),
StackingEstimator(estimator=BernoulliNB(alpha=0.001, fit_prior=True)),
DecisionTreeClassifier(criterion="entropy", max_depth=5, min_samples_leaf=12, min_samples_split=9))
exported_pipeline2.fit(x_trainres, y_trainres)
automl2 = exported_pipeline2.predict(x_test)
# classification report for autoML & normal data
from sklearn.metrics import classification_report,confusion_matrix
print(classification_report(y_test, automl))
# classification report for autoML & oversampled data
print(classification_report(y_test, automl2))
from sklearn import metrics
# ROC - AUC Score for autoML in normal data
automl_pred_proba = exported_pipeline.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, automl_pred_proba)
auc = metrics.roc_auc_score(y_test, automl_pred_proba)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - AutoML & Normal Data')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
# ROC - AUC Score for autoML in oversampled data
automl_pred_proba2 = exported_pipeline2.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, automl_pred_proba2)
auc = metrics.roc_auc_score(y_test, automl_pred_proba2)
roc_auc = metrics.auc(fpr, tpr)
plt.figure(figsize= (10,5))
plt.title('Receiver Operator Characteristic - AutoML & Oversampled Data')
plt.plot(fpr,tpr,label="DT, AUC="+str(auc))
plt.plot(fpr, tpr, 'b', label= 'Rounded AUC = {}'.format(round(roc_auc,2)))
plt.legend(loc = 'lower right')
plt.plot([0,1], [0,1], 'r--')
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
From the observations:
# fit our tuned model to the whole of original dataframe
tuned.fit(x,y)
# prediction results
from sklearn.metrics import classification_report,confusion_matrix
all_pred = tuned.predict(x)
all_predprob = tuned.predict_proba(x)
print(classification_report(y, all_pred))
# saving algorithm for further usage
import pickle
filename = 'hmeq_loan_default_tuned.sav';
pickle.dump(tuned, open(filename, 'wb'))
Through all the tests & observations, there are three most frequent features that keeps showing up and they are DELINQ, CLAGE, and NINQ features with DELINQ as the most fundamental feature. These feature(s) could be the feature(s) that has better association towards our target variable (BAD) and may prove to be important for further tests & observations.
There's a cluster of outliers that is consisted of 37 datapoints & has an alarming sign towards loan default. In relative to other customers, the bank needs further precaution for this type of customer.
In overall, the best model for our prediction is RandomForestClassifier (Oversampled) that's hyperparameter-tuned. It even gives better results than the AutoML model, especially in its Recall since we're trying to minimize False Negatives in our predictions.
There's an indication that the model predictions could improve with larger datasets. It's recommended for further tests & observations to have bigger datasets.